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In the information age, Bangla news articles on the internet are fast-growing. For 
organizing, every news site has a particular structure and categorization. News arti- 
cle classification is a method to determine a document’s classification based on vari- 
ous predefined categories. This research discusses the classification of Bangla news 
articles on the online platform and tries to make constructive comparison using sev- 
eral classification algorithms. For Bangla news articles classification, term frequency- 
inverse document frequency (TF-IDF) weighting and count vectorizer have been used 
as a feature extraction process, and two common classifiers named support vector ma- 
chine (SVM) and logistic regression (LR) employed for classifying the documents. It 
is clear that the accuracy of the experimental results by applying SVM is 84.0% and 
LR is 81.0% for twelve categories of news articles. In this research work, when we 
have made comparison two renowned classification algorithms applied on the Bangla 
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news articles, LR was outperformed by SVM. 
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1. INTRODUCTION 

Text data is the most comprehensive source of information but due to its unstructured nature, it is 
difficult and time consuming to draw insights from it. Advancement in machine learning (ML) and natural 
language processing (NLP) are making it easier to analyze text data through text classification. Text clas- 
sification techniques are used to organize, structure and categorize text data for analysis of sentiment, topic 
labelling, spam detection, intent detection and so on [1]. This study applies text classification techniques to 
Bangla news articles as the news articles are the most common form of text online. Many studies have been 
conducted on text classification, and different classification techniques such as rules-based method, decision 
trees, k-nearest neighbors (KNN), naive Bayes, logistic regression, support vector machines (SVM), neural 
networks (NN) and so on have been developed [2]. Studies in the literature of various research papers have 
shown that researchers have focused on classifying texts in different languages, such as English and Arabic 
and so on [3], [4] however, the amount of analysis on the Bengali language is much less. In [5] have used ma- 
chine learning algorithms to categorize Bangla newspaper articles into five distinct groups. They used logistic 
regression, SVM, and multi-layer neural network to do so. Multi-layer dense neural network approach has a 


Journal homepage: http://telkomnika.uad.ac.id 


TELKOMNIKA Telecommun Comput El Control o 585 


accuracy of 95.50% which is higher compared to other models. There is a representation [6] of the behavior 
of least-squares SVMs, twin SVMs, and least-squares twin SVM (LS-TWSVM) classifiers on the news data 
is shown to handle multi-category data. And their performance evaluation showed that LS-TWSVM is the 
best of all three with 92.96% accuracy. Maisha et al. [7] perform sentiment analysis on Bangla news using 
pipeline” class along with six state-of-the-art supervised ML algorithms which includes decision tree (DT), 
multinomial naive Bayes (MNB), k-nearest neighbor (KNN), logistic regression (LR), random forest (RF) and 
lagrangian support vector machine (LSVM). Random forest algorithm out stands all other algorithms securing 
98% accuracy in percentage split method. This methodology categorizes the content either as positive or neg- 
ative. Rabbinnov and Kobilov [8] used SVM, decision tree classifier, random forest, LR, and MNB among six 
other machine-learning algorithms to conduct multi-class text categorization of internet Uzbek news articles. 
For SVM, they used radial basis function (RBF SVM) which gave the best accuracy (86.88%) performance 
of other classifiers. Several supervised machine learning as well as deep learning algorithms for categorizing 
Bengali news documents are discussed in [9]. A never method for classifying Bangla textual content has been 
developed by [10]. The deep learning recurrent neural networks (RNN) - based attention layer and the RNN 
with BiLSTM achieved accuracy rates of 97.72% and 86.56%, respectively. In [11] proposed a model structure 
named the DCLSTM-MLP model for the categorization of news text documents, an idea of a customized algo- 
rithm that combines deep learning algorithms such as long term short memory (LSTM), convolutional neural 
network (CNN), and multi-layer perceptron (MLP). By applying this model structure they have tried to solve 
the problems of textual length, the complexity of extracting features from news content and categorizing news 
text effectively and achieved 94.82% accuracy. However, the main issue of their paper is that the number of 
samples is small, and the distribution of different types of news is unequal, resulting in the model’s narrow 
effectiveness. 


Kowsher et al. of [12] used several word embedding methods to incorporate the text from Bangla 
newspaper data, as well as machine learning algorithms to categorize the incorporated text. A deep recurrent 
neural Network was used to provide a new method of evaluating Bangla news articles in [13]. The deep 
recurrent neural network featuring Bi-LSTM obtained 98.33% in Bengali text categorization, which is greater 
than previous well-known classification techniques. For classifying text or documents, many of supervised 
techniques are used such as KNN, naive Bayes (NB), DT, n-grams, neural networks (NNet). But according to 
previous research literature reviews SVM [14] is most frequently used classifier algorithm. In [15] proposed 
a news article classification model framework based on deep hybrid learning and compared it to traditional 
text classification to demonstrate the superiority of network news text classification, and it outperforms the 
standardized method for classifying news texts in terms of overall performance. In [16] showed a comparative 
analysis among DT, KNN, NB, and Rocchio’s algorithm where their studies said that SVM outperforms better 
than all other classifiers. Aside from the English document, there has also been extensive research on other 
languages. In [17] focused on the classification of self-created Indonesian news corpus, which includes four 
separate categories and 472 Indonesian newspaper articles from a variety of sources. They produced five models 
with ten epoch each using 377 data for training and 95 testing data, with CNN having the highest accuracy rate 
at about 90.74 percent categorizing Indonesian news data. In [18] conducted on an Indonesian news corpus 
found that the combination of TF-IDF and MNB beats other classification models such as multivariate Bernoulli 
naive Bayes (BNB) and SVM with an accuracy of 85%. In terms of Arabic text classification, in [19] Arabic 
medical text documents are classified and authors used an rule-based classifier for classification of Arabic text 
and having an accuracy of 90.6%. They used three classification algorithm: majority voting, ordered decision 
list, and K-NN which are used to validate the model. 

Among all the couple works covered in the Bangla, Chy et al. [20] worked on Bangla language news 
classification and they used naive Bayes classifier. But the problem is that in the naive Bayes model if the 
testing data set contains a categorical variable of a category that was not part of the training data set, it will 
give it zero probability and be unable to make any predictions. Nahar et al. [21] demonstrated a comparative 
analysis utilizing naive Bayes classifier, SVM, and neural networks to filter Bangla sports and political news 
only on online networks from text data. Mandal and Sen [22] showed the comparison performance evaluation of 
the 4 supervised learning techniques: decision tree, naive Bayes, k-nearest neighbor, and SVM for the Bengali 
text categorization. They used TF-IDF to comprise a feature vector using 1000 web documents which contains 
22,218 words. Close to our research work, by using TF-IDF as a feature selection approach and SVM as a 
classifier, they attained a classification accuracy of 92.57 percent for 12 categories of Bangla text files. They 
used 3191 text samples per category where we have tried to use 11770 articles of top 12 categories (showed 
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in Table 1) from 20 different categories. And best of our knowledge they used only SVM where we have used 
SVM and logistic regression (LR) and LR is primarily used to classify observations into a discrete number 
of categories with rapid classification of unknown records. In this research paper, we have tried to show a 
comparative performance analysis using SVM and LR which can bring righteous path for future researchers 
who are willing to work with Bangla language in this era. 

In this study, LR and SVM are used to classify the Bangla news articles as the study [23] shows that 
the logistic regression outperforms other techniques such as random forest and k nearest neighbours algorithm. 
It classifies the Bangla news articles and label them to a certain topic based on their contents. It extracts features 
from the corpus using the TF-IDF feature vector since it is enabled to the extraction of relevant features as well 
as removing common words. 


2. METHOD 

The aim of news classification seems to allocate categories as per the content of a news article. Pre- 
processing of Bangla news articles and feature set extraction are also needed prior to training and model con- 
struction for document classification, like English text classification. The overall Bangla news article classifi- 
cation process have been used in this experiment as illustrated in Figure 1. 
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Figure 1. System architecture of the proposed work 


2.1. Data collection 

Despite the limited amount of the data, Hossain et al. [24] discovered that text-graph convolutional 
neural (GCN) accomplished better than GRU-LSTM, BiLSTM, Char-CNN, LSTM, and bidirectional encoder 
representations from transformers (BERT) in classifying online Bangla news. As so far, there is a scarcity of 
standard data set in the Bangla language, so a data set have been prepared by scraping the news articles from 
various electronic news sites such as *https://www.prothomalo.com/’. At the time of scrapping, the articles were 
labeled with their categories. To assess the recommendation results, we have collected around 12.5 K labeled 
news articles consisting of 20 categories. Among them top 12 categories are considered for this research work 
which has mentioned in Table 1. 


Table 1. Category-wise count of the Bangla news articles 


Serial Category Articles Serial Category Articles 
1 Bangladesh news 5029 7 Feedback news 578 
2 Sports news 1387 8 Other news 447 
3 North America news 809 9 Citizen news 414 
4 Entertainment news 750 10 News of distance migrants 393 
3 International news 746 11 Lifestyle news 264 
6 Economic news 711 12 Science and technology news 242 
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2.2. Data preprocessing 

Preprocessing is performed to minimize the noise in the text, which helps to increase the classifier’s 
efficacy. The preprocessing steps clear the text data and preparing it for the artificial learning model. Then the 
tokenization is performed on text documents and break down the sequence of characters or words to get the 
features. Since text documents contain sentences, like a sequence of character or words, to get the features, it 
should be broken into tokens. Tokenization is the process to breakdown the text into sentences and then words. 
Tokenization is the process by which the text is divided into phrases and then words. After tokenization, 
different symbols such as !, x, ¿, į, $, %, and numbers, which are not very important for classification, are 
excluded. Stop-words in Bangla are also eliminated at the same period. In [25] is a list of stop-words used in 
this research. 


2.3. Feature extraction 

The general method of conversion of a collection of text documents to numerical feature vectors 
is vectorization. Count vectorizer transforms a text document array into a token counting matrix: contains 
token counts occurrences in each document. A sparse representation of the numbers is generated by this 
implementation. On the other hand, the purpose of using TF-IDF instead of the raw frequencies of a token’s 
occurrence in a given document is to scale down the influence of tokens that occur in a given corpus very 
regularly. And thus empirically less informative characteristics that occur in a small fraction of the training 
corpus are removed. It is a method of data retrieval that weights the frequency of words (TF) and the inversed 
document frequency (IDF). Every word has it’s own TF and IDF ratings, respectively. The TF-IDF weight is 
the product of a term’s TF and IDF scores. The rarer the word and vice versa, the higher the TF-IDF score. 


2.4. Classifiers 

In this study, two common classifiers: the SVM classifier with a linear kernel and the logistic re- 
gression have been used. In text classification, SVM has been used effectively. And logistic regression is a 
statistical method of data analysis in which one or more variables are used to determine the result. 


2.4.1. SVM 

In essence, SVM is a supervised machine learning approach known as a binary classifier. In [25] 
uses hyper-plane to classify data into two types. Support vectors are closer to the-hyper-plane data points that 
influence the position and orientation of the hyper-plane. The support vector is used to optimize the perimeter 
of the classifier. By eliminating the support vectors, the hyper-plane’s position will change. SVM recognizes 
the following sign function of equation [26] mathematically, 


F (x) = sin(wa + b) (1) 


where w is an n weighted vector in R”. By dividing the space R” into two half-spaces with the maximum 
margin, SVM finds the hyper-plane in (2). 


y = wr +b (2) 


Generally, SVM is a binary classifier, but the strategies like one-to-one and one-vs-rest can be used to 
expand into a multi-class classifier. In SVM, linear and radial basis function kernels are used to make decisions. 
Since most texts are linearly separable, the linear kernel is preferred for text classification. 


2.4.2. Logistic regression 

Logistic regression uses a logistic function to estimate probabilities for the relationship between one 
or more independent variables and the dependent variable of the categories. The logistic function equation is 
also known as the logistic curve which is a common “S” shaped curve described by the (3). The sigmoid curve 
is another name for the logistic curve. 


L 


F(a) z 1 + e-ku-00) 


(3) 


Where, L is the curve’s highest value, e the base of a natural logarithm (or euler’s number), k is the logistic 
growth rate or curve steepness, vo is the sigmoid midpoint’s x-value. 
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3. RESULTS AND DISCUSSION 

The performance of SVM classifiers and logistic regrassion on our data-set is discussed in this section. 
A little analysis on the data-set is provided here. Later, the performance analysis of SVM and LR on the Bangla 
text data and comparison with some similar works are demonstrated in Table 4. 


3.1. Dataset analysis 

Among 12.5 thousand of articles with 20 categories we used 11770 articles of top 12 categories in 
this research work. Bar-chart for number of news articles of each categories are given in Figure 2. In Figure 
3 generated word cloud based on TF-IDF and count vectorizer from total articles are given. This is generated 
before running the preprocessing on the data. 
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Figure 2. Numbers of the Bangla news articles for each category 


ae Neato ai 


Figure 3. Word cloud for the articles used based on TF-IDF (left) and count vectorizer (right) 


a} } 


a" 


We can see many unnecessary words are available in the clouds and the cloud-based on TF-IDF 
is different than the cloud-based on count vectorizer value. But some words are having around the same 
importance for both TF-IDF and count vectorizer value. We have split the data into two parts, for training the 
model, 80% of the data is used and the remaining 20% is used to test the performance. 
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3.2. Performance measures 

In-text categorization, a variety of evaluation metrics are employed. The precision, recall, and F- 
measure are among the most widely used performance measures considered in our experiments. Each cate- 
gory’s precision, recall, and F-measure experiment are measured for SVM and LR. The F-measure is averaged 
to assess performance throughout categories. Micro average and macro average are the two types of average 
value and the macro average is used. Table 2 show how two classifiers performed on our data-set corpus in 
terms of precision, recall, and F-measure. On the other hand Table 3 shows comparison of accuracy, average 
precision, average recall and average Fl-score result for SVM and LR classifier. 


Table 2. SVM and LR classifier results of 12 categories 


Categ. Precision Recall Fl-Score Categ. Precision Recall F1-Score 
SVM LR SVM ILR SVM LR SVM LR SVM LR SVM LR 
0 0.89 0.84 0.96 0.96 0.92 0.90 6 0.69 0.64 0.74 0.69 0.71 0.66 
1 0.94 0.91 0.98 0.98 0.96 0.95 7 0.77 0.78 0.78 0.76 0.77 0.77 
2 0.74 0.78 0.72 0.68 0.73 0.73 8 051 O50 0.21 0.13 0.29 0.21 
3 0.88 0.84 0.87 0.83 0.88 084 9 0.59 060 049 O41 053 0.49 
4 0.70 0.65 0.68 0.61 0.69 0.63 10 0.88 0.93 0.61 049 0.72 0.64 
5 0.81 O80 0.78 0.73 079 0.76 11 0.69 0.68 0.57 0.43 0.62 0.53 


It is observed that SVM with linear kernel achieves accuracy of 0.84 and LR achieves 0.81 from 
Table 3 and SVM outperforms LR in case of average precision, recall and Fl-score. Though from Table 2 it 
is found that for some classes or categories LR outperforms SVM in case of individual precision, recall and 
Fl-score. Figure 4 shows the confusion matrix for SVM classifier with the linear kernel and logistic regression 
respectively. Table 4 shows the comparison between the recent works using SVM classifier findings with the 
results of this research finding. The comparison shows that our research work utilizing the combination of 
TF-IDF and count vectorizer features in the SVM classifier achieves better accuracy than other recent works 
which also used SVM in their research works. 


Table 3. Accuracy, average precision, average recall and average Fl-score result of SVM and LR classifier 
Accuracy Precision (Avg) Recall (Avg) Fl-score (Avg) 
SVM LR SVM LR SVM LR SVM LR 
0.84 0.81 0.76 0.75 0.70 0.64 0.72 0.68 
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Figure 4. Confusion-matrix for SVM (left) and LR (right) 
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Table 4. Comparison between this experiment and recent similar researches using SVM 


Experiments Feature used Accuracy 
This experiment TF-IDF + Countvectorizer 0.84 
Islam et al. [27] Trigram TF-IDF 0.80 

Rahman et al. [28] TF-IDF 0.82 
Yeasmin et al. [5] TF-IDF 0.83 


4. CONCLUSION 

In the field of information systems, text categorization is a contentious issue. In this paper, we have 
used SVM and LR classifiers on own developed corpus and their performance is measured. The proposed 
methodology in this study advocates the assumption that the Bangla language can indeed be rightly classified 
using SVM and LR with limited resources. The outcome of the suggested approach is promising. Still, more 
precision could have attained. Good outcomes could also have obtained if we had been able to deal with all 
of the news categories and utilize multiple categorization methods to come with a constructive opinion. In the 
next, we’d like to add more categories and compare the results using different classification algorithms. 
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